This document details several tests of simulation validity/performance conducted using real datasets. Data/articles used for testing can all be found in the “References” section below and at https://github.com/E-Y-M/poweROC/tree/main/Dataset%20testing%20and%20reports, and were obtained from the Open Science Framework. If you have ROC data (along with analysis parameters) you are willing to share for the purposes of simulation testing, feel free to email me at ericmah@uvic.ca. Issues/comments on the app or simulation testing results can be posted on GitHub at https://github.com/E-Y-M/poweROC/issues.
At a basic level, simulation validity depends on the ability of the simulations to recover AUC values close to those in the original dataset. The figure below depicts original AUC estimates from various papers with open data (5 papers, 8 experiments, 22 ROC curves computed using the same N’s/pAUC cutoffs in the original papers), along with simulated estimates and intervals:
Testing the ability of the simulation to recover AUC values from experiments. Open circles represent original AUC values, all other points represent simulation estimates under various conditions (“NSims” = Number of simulated datasets per effect size/N, “NBootIter” = Number of bootstrap iterations per AUC comparison). Error bars = 95% quantiles on the mean estimated AUC for the simulations. Overall, simulations demonstrate excellent ability to recover original AUC values, even under default settings (NSims = 100, NBootIter = 1000).
Still, the question remains as to whether increasing the number of simulations or bootstrap iterations increases power. The figure below shows the width of the 95% quantile intervals for the AUC estimates above, as a function of the simulation conditions.
Based on these simulations, it does not seem that increasing the number of simulations or bootstrap iterations necessarily or substantially increases the precision of the simulation beyond the default settings, suggesting that the default settings will result in reasonable estimates.
It is not clear whether the default simulation settings result in the most accurate power estimates. I simulated power for 13 ROC comparisons from the papers above. I also conducted two simulation runs of a dataset with a prespecified null effect (using the “Medium Similarity” condition from Colloff et al., 2021a as a base) to compare with the normative Type I Error Rate of .05, all under the three different simulation conditions. These power estimates are plotted below:
Power estimates differed slightly across the different simulation conditions, but no clear patterns emerged. In these examples, the maximum range of estimated power was .10. Importantly, power estimates in the two null effect simulations were close to the nominal Type I Error Rate of .05 (though the non-default settings resulted in slightly higher estimates).
Finally, I examined the behaviour of the different simulation settings for a full simulation example (i.e., involving multiple N’s/sample sizes). I simulated power for 5 effect sizes and 3 sample sizes (1000, 3000, 5000), again using the “Medium Similarity” condition data from Colloff et al. (2021a) as a base. For each simulation setting I ran two simulations to get a basic idea of run-to-run consistency. First, the hypothetical ROCs that were tested for this analysis:
Next, power curves for these simulations:
These simulations result in the same general expected pattern, but there are a few things worth noting. First, it appears that the default settings (though the fastest for simulation) result in a good amount of run-to-run variability, and a violation of power simulation expectations (i.e., higher power for a smaller effect size with the same sample size). Between increasing the bootstrap iterations and increasing the # of sims, increasing the # of sims seems to result in more run-to-run consistency while maintaining the expected patterns of results, at the cost of increasing the required simulation time. At least in these examples, upping the default values of both NSims and NIter did not seem to offer substantial benefit over increasing NSims, and increasing NSims beyond 200 did not seem to result in a substantial gain.
I tested a) the ability of the app to recover published DPP values (from Smith et al., 2019, who used the “concealment” and “nothing” condition data from Colloff et al., 2016). The reported DPP values in Smith et al., 2019 were .86 and .82 respectively, with a DPP difference of .04 (95% CI [.007, .087]). In an initial simulation run using the same data and sample size (and using 100 simulated samples and 2200 bootstraps per DPP test), the simulation DPPs were .87 and .82, with an estimated DPP difference of .05 (95% CI [-.008, .10]). In a second simulation run, the simulation DPPs were .87 and .82, with an estimated DPP difference of .05 (95% CI [-.006, .11]). Aside from some discrepancies with the DPP difference CIs due to slightly different calculation methods, powe(R)OC recovered the DPP values accurately and consistently across different simulation runs.
In a test of long-run Type I error rates, across two simulations using base data with a null effect the DPP Type I error rate was .04 and .03 (slightly outperforming pAUC, which had a Type I error rate of .07 in both cases).
Finally, I conducted two runs of a full power simulation with both AUC and DPP. The hypothetical ROC curves were again constructed using the “Medium Similarity” condition data from Colloff et al. (2021a) as a base:
The resulting power curves for Run 1:
…and for Run 2:
Again, power estimates differed slightly across runs (mostly at smaller sample sizes), but results were consistent overall. Interestingly, in this case it appears that power to detect the differences is substantially higher for DPP than for pAUC. I hesitate to draw any general conclusions, given that this is a single simulation (and additionally, in the validation using the data from Colloff et al., 2016, pAUC held a power advantage over DPP). However, this does suggest that power can differ substantially depending on the measure used, and suggests some further avenues for testing when one measure provides more power than the other.
powe(R)OC offers two means for users to simulate data: 1) resampling new data using confidence-response proportions from the original dataset or 2) generating new data from an sdtlu model fit to the data. The below sections detail some initial comparisons of these methods.
First, I examined and compared the degree of variability in ROCs generated via the two available methods in the app. This was done for two datasets – the Colloff et al. (2021a) Experiment 2 High- vs. Low-similarity fillers, and the Palmer et al. (2013) Experiment 1 Short- vs. Long-delay. For each dataset, I generated new datasets with 500, 3000, and 6000 lineup trials (via resampling), and then for the 6 new datasets I simulated 50 ROC curves using either method (data vs. sdtlu). Results are shown below, with the simulated ROC curves plotted against the ROC curve in the base data: